arXiv:1507.02987vl [cs.DC] 10Jul2015 


GENOOGLE: 

AN INDEXED AND PARALLELIZED SEARCH ENGINE FOR 
SIMILAR DNA SEQUENCES 


FELIPE FERNANDES ALBRECHT 


Abstract. The search for similar genetic sequences is one of the main bioin¬ 
formatics tasks. The genetic sequences data banks are growing exponentially 
and the searching techniques that use linear time are not capable to do the 
search in the required time anymore. Another problem is that the clock speed 
of the modern processors are not growing as it did before, instead, the pro¬ 
cessing capacity is growing with the addiction of more processing cores and 
the techniques which does not use parallel computing does not have benefits 
from these extra cores. This work aims to use data indexing techniques to 
reduce the searching process computation cost united with the parallelization 
of the searching techniques to use the computational capacity of the multi 
core processors. To verify the viability of using these two techniques simulta¬ 
neously, a software which uses parallelization techniques with inverted indexes 
was developed. 

Experiments were executed to analyze the performance gain when paral¬ 
lelism is utilized, the search time gain, and also the quality of the results when 
it compared with others searching tools. The results of these experiments were 
promising, the parallelism gain overcame the expected speedup, the searching 
time was 20 times faster than the parallelized NCBI BLAST, and the searching 
results showed a good quality when compared with this tool. 

The software source code is available at https://github.com/felipealbrecht/Genoogle 


1. Introduction 

One of the most important tasks at the bioinformatics is the search for similar 
genetic sequences in the data banks. With the new sequencing technologies, the 
size of these genetics data banks are growing exponentially [4], and consequently 
the search time is growing too. 

The alignment algorithms, Needleman-Wunsch |21) and Smith-Waterman [24] 
are algorithms of the dynamic programming class [6]. They are very sensible 
alignments algorithms, but their computation and memory costs are quadratic 
0{mn) (where m is the input sequence length and n the data bank length). This 
computation and memory costs turn them impractical at large data banks sets. 
To address this problem, heuristics was developed to reduce the memory and pro¬ 
cessing consumption of the similar sequences searching processes. Among the al¬ 
gorithms which use heuristics for similar sequences searching, the FASTA [23] and 
BLASTm^\^ algorithms are the more used and know algorithms. These algo¬ 
rithms search for areas which has similarities, called HSP (High Scoring Pairs), and 
then, they make the alignment of the best HSP found using a dynamic programming 
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algorithm. The FASTA and BLAST algorithms also optimized the alignment cost, 
because the alignment is made between words which have a previously detected 
similarity only. Nevertheless, the problem complexity continues being 0{nmq) {q 
is the quantity of sequences in the data banks) because it still necessary to read 
the data bank sequences entirely to find the HSPs. 

The searching process optimization can be made using inverted index to localize 
HSPs. Inverted indexes are data structures which allow to find the localization 
of the indexed data at constant time (0(1)). This type of indexes are utilized at 
Information Retrieval area, especially in web searching engine, as Google, Yahoo, 
and libraries for data indexing, as the Apache Lucene [9]. 

For the search for similar genetic sequences, the sub-sequences of the data bank 
sequences are indexed, eliminating the necessity of finding the HSPs thought a 
linear search. It is quite similar with the indexes found at the end of the books, 
where rather than to have the text entry, it have a sub-sequences entries informing 
where these sub-sequence can be found. The two most common data structures 
which are usually used to index the genetic sequences data bank are the suffixes 
trees and vectors. Suffixes tress are used at the work of Gusheld this data 
structure allows to access the sub-sequence position in linear time, but the real 
virtue is to find repeating sequences regions and to obtain the longest common 
ancestral. One example of suffixes tress application is presented by m, where 
an algorithm for building tree to find near-exact sequences is utilized. At this 
example, the access time gain has a high memory consumption cost. At [22] is 
said that the Delcher et al. [7] has a memory spent of 37 bytes by base at her 
implementation when using suffixes trees. As comparison, using the human genome 
with approximately three billions base pairs, are necessary more than 103 gigabytes 
to store the suffix tree. At [8] an optimization is shown in relation to previous work, 
where the algorithm is three times faster and it uses one third of the memory. Even 
with this reduction, it is necessary more than 34 gigabytes to store the suffix tree 
to index the human genome. 

Vector is a data structure which is mainly an array of elements, where each 
position of this array represents an information and inside this position, have an¬ 
other array informing where this information can be found. Some similar search¬ 
ing sequences techniques that uses vectors as inverted indexes are: 5S'AifA[22j, 
RLATpA]. PatternHunter m the miBLAST[l()\. Meaablast[2h\. MeoaBlast\18\. and 
Kalafus m which uses hash tables to align whole genomes. Transforming based 
methods are considered out of the scope of this work, but Jing m presents methods 
that use this technique. 

Another possible way to reduce the searching time is through parallel computing. 
The importance of the parallelization has grown along with the growing multipro¬ 
cessing capabilities, like the number of the cores, in the modern processors. Some 
tools for similar sequences searching have ways to use these multiprocessing tech¬ 
nologies. As one example, it is possible to inform the NCBI BLAST to divide the 
data bank and perform parallel searches at these data banks fragments. As the 
data bank is divided and the search parallelized, the computational complexity is 
Q(n^) quantity of fragments that the data bank will be divided), or 

it means, even with the search parallelization, it still have a linear computation 
complexity. 
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The computational complexity can be reduced using inverted index, but search¬ 
ing at the literature, it was not found any technique which uses indexing with 
parallel programming, and consequently they do not use the capacity of moderns 
the multi-core processors. Thus, the objective of this work is to use the index 
techniques to optimize the searching process along with the use of the multi-core 
processor to reduce the searching time of the similar genetic sequences. The devel¬ 
oped software, called Genoogle uses indexing techniques with inverted index, index 
search parallelization, data bank division and parallelization, bit level sequences 
codification and alignment algorithms optimizations. This software was developed 
utilizing the Java environment and it has a web page, web services and text mode 
interfaces. 


2. Methods 

This work presents the Genoogle, that is a similar sequence searching engine soft¬ 
ware which has as objective to execute fast and with good sensibility searches. To 
achieve these goals, it uses inverted index to find nimbly the HSPs and it also uses 
parallelization techniques to use the multi-core processors capabilities. Genoogle 
works similarly to the BLAST, where given an input sequence, a parameter set, 
and a data bank where the search will be made, the software returns the similar 
sequences from the given input and parameters. 

Defining a genetic sequence as a sequence where S = {A, C, G, T} [DNA), 
S={A, C, G, U} [RNA) and a sub-sequence is a sequence wich is contained par¬ 
tiality of fully into other sequence. The sequences at Genoogle are divided into 
fixed length sub-sequences, where the length is defined by the user before the data 
bank formatting, and them, they are codified using 2 bits for each sequence base. 
The sub-sequences are stored as a bit vector into 32 bits integer and the length can 
vary from 1 to 16 bases. Changing this value impacts at the search speed and sen¬ 
sibility, as big is the value, as fast will be the searching process, but the sensibility 
will be lower. To save memory, not overlapping windows are utilized to codify the 
data bank sequences, but for input sequences, to have more sensibility during the 
searching process, overlapped windows are utilized. These codified sequences, with 
other sequence information: name, identification code, and description are stored 
into a file disk, composing the data bank sequences. 

Masks are used at the indexed sub-sequences to improve the searching sensibility 
of the inverted index. The masks are based on the PatternHunter m work. The 
masks inform which sub-sequences bases should be maintained or removed and, 
consequently they increase the probability to find sub-sequences at the index. The 
masks provide two gains: the sensibility, which allows to search at the index with 
not-exact sequences, and it saves index space, because longer sub-sequences will be 
transformed into smaller, having less index entries and less sub-sequences at the 
data bank. 

Genoogle has some run time parameters which can improve the sensibility or the 
search performance. The parameters for the sensibility are: the maximum distance 
between index entries for be considered for the same HSP, the minimum HSP size 
and the drop off for sequences extension. Changing these parameters impact on the 
sensibility, allows to find more HSP, but it is expected to have more false positives 
and to slow down the performance. 
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2.1. Inverted Indexes. Genoogle use vectors as inverted indexes data structures. 
The size of the main vector is the quantity of possible sub-sequences: as the DNA 
alphabet has 4 letters, it is possible 4” sub-sequences, being n the defined sub¬ 
sequences length, having the inverted index 4" entries. The size of each sub¬ 
sequence vector varies according the quantity of this sub-sequence into the data 
bank. Each sub-sequence occurrence uses two integers of 4 bytes to store the in¬ 
formation: One integer is used to store the identifier of the sequence and the other 
integer is used to store the position at the sequence. The inverted index uses 4 
bytes to the sequences identifier and more 4 to the position at this sequence, turn¬ 
ing possible to index approximately 4, 25 billions (2^^) sequences and each one can 
have until this length, being the available memory the major limit to the quantity 
of sequences and their size. 


Inverted index 


AAAAAAAA 

AAAAAAAC 

AAAAAAAG 

AAAAAAAT 

AAAAAACA 

AAAAAACC 


TTTTTTGT 

TTTTTTTA 

TTTTTTTC 

TTTTTTTG 

TTTTTTTT 


Sub-sequences 
localization vector 


_[<10,22>,<44, 1>] 

Sequence position 
Sequence identificator 


Figure 1. Sub-sequences inverted index structure. 

The mask are applied at the data bank sequences during the indexing process. 
They are read and divided into sub-sequences, at each one the mask is applied on 
its sub-sequences and hence these masked sub-sequences are indexed. Using masks 
allow having longer sub-sequences with smaller sensibility loose. For example, using 
the mask 111010010100110111, where 1 means that the base in that location should 
be preserved and the 0 means that the base there should be removed, the sub¬ 
sequences read from the data bank will have 18 bases, and after the mask be 
applied, will have 11 bases. In this way, it is used a vector to index 11 bases 
sub-sequences, but the sequences are divided into 18 bases sub-sequences, saving 
inverted index structure memory. 

Genoogle needs approximately {{{l/m)s)8) + ((d'^jlb) bytes to store the inverted 
index. Being I the length of the data bank in bases, m, the total mask length, and s 
the sub-sequences length. For a data bank with 4 billions bases and sub-sequences 
with 11 bases, there will be 363 of millions sub-sequences, requiring 2.833 megabytes 
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to store the inverted index. Using masks with 18 bases length, there will be 222 
millions sub-sequences and the inverted index will need 1.759 megabytes, resulting 
on a gain of 30% of the total memory required. To build the inverted indexes it 
is used the sort-based method [26] and the inverted index and the formatted data 
bank are stored at the disk. During the Genoogle start time, the whole inverted 
index is read and loaded into the main memory, the data bank meta informations, 
like file offset, are also loaded into the main memory, but the data bank sequences 
are read from the disk when the sequences informations are necessary. 

2.2. Searching process. After that the inverted index and the data bank meta in¬ 
formations are loaded into the main memory, Genoogle is ready to run the searches. 
The searching process is divided into 7 phases: 

• Input sequence processing; 

• Index searching for similar sub-sequences and construction of the HSPs; 

• HSPs extension and merge; 

• Merging overlapped HSPs; 

• Selection of the high scored HSPs; 

• Local alignment of the selected HSPs; 

• Selection and exhibition of the best alignments. 

The input sequence processing firstly applies the mask at each overlapped sub¬ 
sequence of the input sequence and codify the resulting sub-sequence to the binary 
representation used by Genoogle. The input sequence processing is shown at Fig. [51 
As the input sub-sequences are codified as binary data into an integer, it is possible 
to obtain the sub-sequence value directly from this data. Because the determined 
sub-sequence position at the index is its own value it turns the index searching 
process simpler and direct. 



Figure 2. Input sequence processing. 

At Fig. 121 is shown the process to retrieve the informations from the inverted 
index. For each masked and encoded sub-sequence from the input sequence, using 
its encoded value, is retrieved from the inverted index all places which have this 
sub-sequence at the data bank sequences. The retrieved informations are stored 
into an array of arrays, where each position represents a data bank sequence. If 
two or more retrieved information are are closer than a specified parameter, they 
are merged into one retrieved information. These informations are filtered by their 
length and the remaining ones are them called High Scoring Pairs (HSP). The HSP 
have five information: Initial and final positions at the input sequence and at the 
data bank sequence, and the length of this area, where it gets the length of the HSP 
in relation of the data bank and of the input sequence and get the smaller value. 
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Pos 1752 





MASKED SUB-SEQUENCE 
CGTCGA 

CODIFIED SUB-SEQUENCES 
011011011000 

INVERTED INDEX POSITION 
1752 


Inverted index 


<312,32 ><532,0 > 


Results of the inverted in 


Figure 3. Obtaining sub-sequence localization data from the in¬ 
verted index. 


After the index searching phase, the located HSPs are extended to the both 
directions to try to enlarge their length. During the extension phase, two or more 
HSPs which were closed can be overlapped, generating duplicate results. Hence, 
it is verified after the extension phase if the HSPs have overlapped regions, and if 
they do, they are merged into one new HSP. 

After the extension and merging phase, the HSPs are sorted by their lengths 
in a decreasing way. It is possible to specify a parameter to inform the maximum 
quantity of alignments which should be returned. The objective of this parameter 
is to reduce the number of alignments built, saving processing and returning only 
the most significant alignments, it is also possible to set to return all HSPs found. 

After the selection, it is made a local alignment for the selected HSPs. The 
local alignment is built by a modified version of the Smith-Waterman algorithm. 
The alignment algorithm is modified in order to limit the distance from the main 
diagonal which will be used to produce the alignment. Rather than to use the whole 
matrix to build the alignment, it is used only the closest cells of the main diagonal. 
This limit is used because the two sub-sequences which will be aligned have a high 
similarity, turning unnecessary to calculate the whole alignment matrix. To reduce 
the memory use, was also used a technique to divide the alignment matrix in smaller 
matrices. This technique divides the two sequences into segments, and each segment 
is aligned separately. After the segments alignment, the results are merged and 
considered the final result of whole sequences. After the HSPs alignments, they are 
sorted by the alignment score, and the results are returned with other information, 
as the e-value, score, normalized score , and alignment position in the input and 
data bank sequence, to the user. 

3. Parallelization Methods 

To improve the searching time and to use the multi-processing capabilities, 
Genoogle uses three parallelization techniques: inverted index access parallelization, 
extension and alignment parallelization, and data bank division parallelization. 

The data bank division parallelization is made dividing the data bank in frag¬ 
ments, similarly how NCBI BLAST does. The indexes searches and HSPs con¬ 
struction are made independently into threads for each data bank fragment. After 




GENOOGLEiAN INDEXED AND PARALLELIZED SEARCH ENGINE FOR SIMILAR DNA SEQUENCES 


locating the HSPs by the individual threads, they are extended, merged, sorted, 
filtered, aligned, sorted by their score, and returned to the user. This technique is 
interesting for large data banks, but this parallelization technique does not paral¬ 
lelize the whole searching process: the extension, sorting and alignment phases are 
not parallelized by this technique. 

Analyzing the searching data bank fragmentation process, it was verified that 
not every searching thread have the same execution time. It happens because some 
fragments have more similar sequences than others. Analyzing the searching time, 
it was verified that the index searching time represents about 60% of the whole 
searching time. About 30% of the searching time is for the HSPs alignments, be¬ 
cause it, a component was developed to parallelize the sequences extension, sorting 
and alignment using all available computer processing core. 

At this parallelization technique, the threads search the input sub-sequences at 
the data bank fragments index and then put the HSPs into a collection shared by 
all index searching threads. After the index searching phase, the HSPs collection 
is sorted by the HSP length and the longest are selected using a parameter that 
informs how many alignments should be returned. The remaining HSPs are put into 
a FIFO queue, where it has extenders and aligners which will perform the extension 
and alignment. These extenders and aligners use independent threads, they read a 
HSP from the queue, them perform the extension and alignment and put the result 
into another shared collection. These threads are managed by an executor, when 
a thread finish its job, the manager sends to it another HSP to extend and align 
until all HSPs are extended and aligned. It is possible to configure the quantity of 
simultaneous threads. 

The memory required by the inverted index structure is the main problem of the 
data bank division. For example: using 11 base pairs length sub-sequences, they are 
necessary 32 megabytes to store only the inverted index structure. Theoretically, 
to use the whole computational capacity of a computer with 8 processing cores, 
it is necessary to divide the data bank in 8 parts, becoming necessary to use up 
512 megabytes to store only the inverted index structure. Because of the memory 
requirement for the data bank division, it is important to use a complementary 
approach. Along with the data bank division, the input sequences are also divided 
and performed the search parallel. Thus it is possible to parallelize the searching 
process without overloading the memory with more data structures. 

This parallelization divides the input sequence in sub-inputs, and it searches 
each sub-input at the inverted index independently. It is also useful because it 
also parallelize the input query processing. After the index search, the HSPs that 
are from the same data bank sequence and are closer, are merged into one HSP. 
Using the data bank division in two fragments and the input sequence in two sub¬ 
inputs, they are used 4 threads to search at the inverted index, being the memory 
overload of only two inverted indexes structures. At the Fig[3]is shown the complete 
parallelized Genoogle search dividing the input query in two sub-inputs and the data 
bank in two fragments. 


4. Implementation 

Genoogle is developed using the Java Environment version 1.6. The Java En¬ 
vironment was choose because it is multi-platform and it has a framework and 
primitives for parallel computing. The library BioJava [12] was used during the 
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Figure 4. Obtaining sub-sequence localization data from the in¬ 
verted index. 


first part of the development, but now it was removed from the project. At the 
first implementations, BioJava was used to reading, parsing, storing and genetic 
sequences alignment, but it was verified that the reading and parsing methods re¬ 
quire too much memory, so, new and optimized classes was developed to perform 
these tasks. 

The main user interface is an text mode interface, where the user types the 
search command and the results are stored into a XML file that is defined by the 
user. This interface has commands to perform the search, to list the available 
data banks, to obtain the parameters list, to run the garbage collection, to run the 
last command executed, and to run a batch file containing commands. The batch 
command is interesting because with it is possible to write a file with all commands 
that should be executed and to inform the Genoogle to execute them without any 
used intervention during the execution of these commands. 

Genoogle has also an embedded simple web page interface, best suitable for test¬ 
ing, which contains only one input field, for the input sequence and a button to per¬ 
form the search. The search results of the web page interface are a XML document 
formatted and shown as a HTML web page by a XSL document. Together with 
the web page, Genoogle has a web-services interface, implemented using JAX- WS. 
Using the web-services interface, it is possible to execute queries, to set parame¬ 
ters, retrieve the data bank available list, and others tasks inside of a programming 
script. The users can write scripts to access Genoogle services automatically using 
their preferred programming language and perform their searches without manual 
intervention. 


5. Results 

For the experiments was utilized a data bank with sequences of the phase 3 of the 
human genome project m along with RefSeq [5D] data banks. The RefSeq data 
banks were used because they are verified and they have high quality. The human 
genome project data bank is the hs-phaseS and has approximately 3.8 Gb. The 
RefSeq data bank are the cow with approximately 57 Mb, frog with 20 Mb, human 
with 112 Mb, mouse with 93 Mb, rat with 73 Mb, and zebrafish with 62 Mb, 
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totalizing approximately 417 M6. At the experiments was used the data banks 
from the RefSeq along with the hsjphase3, totalizing 4.25 Gb and being necessary 
approximately 2 gigabytes of memory to store it. 

For the execution of the experiments, they were generated 11 sets of input se¬ 
quences. Each set has 11 sequences with approximately the same size, being 1 
sequence got from the data bank and 10 are mutations of these sequences. The sets 
are with sequences with 80 base pair (bp), 2006p, 5006p, 1.0006p, b.OOObp, 10.0006p, 
SO.OOObp, lOO.OOObp, SOO.OOObp e 1.000.0006p. The searches were made using the 
data bank parallelism, dividing the input sequence parallelism and extending and 
aligning the sequences simultaneously. 

At the experiments it was used a computer with 16 gigabytes of RAM, Linux 
version 2.6.18 and Java Environment version 1.6 with the JVM JRockit version 

3.0.3. 



Figure 5. Input sequence speedup relation. 


For entries up to 5.0006p the gain using the parallelism is low, giving a gain of 
only 2 times and the total time increases when the input sequence is divided into 
more than 4 parts. It happens because the searching time for these small input 
sequences are too low, less than 200ms, and the synchronization overload impacts 
directly to the searching time. For input sequences with 10.0006p it there is a gain 
of 5 times with the utilization of the parallelization techniques. For inputs with 
50.0005p and up, the speedup are 8, that is the aiming gain, and with 500.0006p 
and 1.000.0006p inputs the gains overcome this speedup. 

The use of tools that use suffix trees was discarded for the time comparison 
because of the memory required by these tools. Thought tests, it was verified that 
the BLAT software can not handle data banks bigger than 4 Gh. MegaBlast was 
not verified because m says that MegaBLAST was developed to be efficient on 
the searching time, but the results quality is worst than NCBI BLAST, it happens 
because the seeds minimum size is 28bp. Because of the lower results quality, it 
was decided not to execute experiments with this tool. The Indexed MegaBlast |18] 
can not be executed because its memory requiring 4 times the data bank, needing 
more than the 16 Gb available. And finally, it was not possible to obtain the 
PatternHunter to execute the experiments. Thus, it was decided to compare the 
performance only against the NCBI BLAST. 

Comparing the sequential search times of Genoogle and BLAST it is shown that 
Genoogle is almost 20 times faster and comparing the parallel times, Genoogle is 
26, 60 times faster. It is interesting to realize that for smaller input, until b.OOObp, 
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the time gains in relation with BLAST are not so good because the parallelization 
techniques does not use all their potential in these small inputs. The time difference 
at bigger inputs comes until 29 times for 100.0006p input. It is important to realize 
that the sequential version of Genoogle is faster than the parallel executions of the 
BLAST. 


Base pairs 

BLAST (ms) 

Genoogle (ms) 

Gain (times) 

80 

5.572 

150 

37,00 

200 

8.882 

460 

19,30 

500 

14.488 

340 

42,61 

1.000 

19.087 

570 

33,48 

5.000 

58.902 

2.400 

24, 54 

10.000 

98.160 

5.318 

18,45 

50.000 

604.785 

31.499 

19, 20 

100.000 

1.973.333 

75.610 

26,09 

500.000 

7.700.571 

393.450 

19, 57 

1.000.000 

1.229.988 

76.909 

16,00 

Total 

11.713.768 

586.706 

19,96 


Table 1. Time comparison between sequential NCBI BLAST and 
sequential Genoogle. 


Base pairs 

BLAST (ms) 

Genoogle (ms) 

Gain (times) 

80 

1.061 

150 

7,00 

200 

2.145 

270 

7, 94 

500 

3.170 

210 

15,09 

1.000 

2.853 

270 

10, 56 

5.000 

10.387 

1.341 

7,74 

10.000 

13.027 

1.050 

12,40 

50.000 

78.067 

4.440 

17, 58 

100.000 

276.779 

9.380 

29,50 

500.000 

1.206.212 

45.120 

26,73 

1.000.000 

193.090 

8.780 

22,00 

Total 

1.786.791 

67.011 

26,66 


Table 2. Time comparison between parallel NCBI BLAST and 
parallel Genoogle. 


5.1. Results quality. The results quality was analyzed comparing the Genoogle 
results against the BLAST results and verifying which HSPs were identified as 
similar and what is the percentage of HSPs that were identified by BLAST and not 
by the Genoogle. For each input sequences, a collection with the found alignments 
by BLAST was created and it is verified if these alignments were found by Genoogle. 
It is accounted how many alignments were found and it is generated a percentage 
for each E-Value range varying from I0e“®° a lOe*^. 

Following is shown the graphic which shows the proportion of alignments found 
by Genoogle in relation to the BLAST according the alignment E-Value. This 
graphic was generated from the data where it was verified which of the alignments 
found by the BLAST were found by the Genoogle too. In this graphic it is possible 
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to observe that until the E-value 10e“^® more than 90% of the alignments were 
found by the Genoogle. Until the E-Value 10e“15 more than 60% of the alignments 
were found and with the E-Value almost 55% of the alignments were found. 

Above this E- Value the quantity of alignments found is bellow 40%. 



Figure 6. Proportion of alignments found by Genoogle in relation 
to BLAST. 

Analyzing this graphic is realized that the Genoogle is a good tool to find align¬ 
ments with the E-Value lower than 10e“^^. It happens because alignments with 
this E- Value are usually long, with more than 1006p, and consequently are found 
easily. From the E- Value 10e“^*^ there is a drop in the results quality, where it is 
stabilized close to the E-Value 1. Alignments with the E-Value highest than 0.005 
are alignments which can not be possible to infer a close homology [3]. This, it 
can be observed that Genoogle has a very good search quality until the maximum 
10e“^° E-Value, but it has a quality drop until 10e“®. The alignments should not 
to be considered to homology inference for values higher than 10e“'^. 

The experiments showed that the Genoogle quality results is comparable to the 
BLAST when the alignment E-Value is representative, demonstrating a possible 
homology between the two aligned sequences. More sensible search can be archived 
changing the searching parameters, as the maximum distance between the sub¬ 
sequences information got from the index, and the minimum HSP length. Ghanging 
the minimum HSP length for alignments with E-Value higher than le“^, the HSP 
found for this E- Value grew to approximately 80% and the search time was raised 
only 3%. 


6. Conclusion 

This work presented a genetic similarity sequences searching software which uses 
data bank sequences indexing along with parallel computing. To ensure the effec¬ 
tiveness and the search quality of to use index and parallel computing was devel¬ 
oped and implemented the Genoogle tool. This software was implemented using 
the Java 1.6 and it can be executed at Windows, Linux, and Mac environment. 
Experiments were executed to verify the results execution time and the results 
quality. The searching time was really good, with speedup of more than 20 times 
in relation to parallelized BLAST. The results quality was good, finding relevant 
alignments, but it can be optimized by changing the searching parameters. Thus, 
merging the indexing techniques with the three parallelization techniques and the 
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option to optimize the search configurations, Genoogle proved to be an effective 
tool and its results have good quality. 

As the main contributions of this work, it should be noted primarily as the first 
tool in the literature to do the genetic sequences search using indexing and parallel 
computing. Considering that parallel computing importance has increased with 
increasing number of cores at the processors and also the importance of the data 
indexing to optimize the searching process in data banks wich has an exponential 
grow, this work has a relevance for addressing these two issues together. 

The software is available at genoogle.pih.bio.br and there is a demonstration page 
at pih.bio.br:8080. 
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